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Abstract 



In cancer research, profiling studies have been extensively conducted, searching for genes/SNPs 
associated with prognosis. Cancer is a heterogeneous disease. Examining similarity and diffcr- 

l/"") ' ence in the genetic basis of multiple subtypes of the same cancer can lead to better understanding 

of their connections and distinctions. Classic meta-analysis approaches analyze each subtype 
separately and then compare analysis results across subtypes. Integrative analysis approaches, 
in contrast, analyze the raw data on multiple subtypes simultaneously and can outperform meta- 
analysis. In this study, prognosis data on multiple subtypes of the same cancer are analyzed. 
An AFT (accelerated failure time) model is adopted to describe survival. The genetic basis of 
multiple subtypes is described using the heterogeneity model, which allows a gene/SNP to be 
associated with the prognosis of some subtypes but not the others. A compound penalization 

k^ ' approach is developed to conduct gene-level analysis and identify genes that contain impor- 

Vh ■ tant SNPs associated with prognosis. The proposed approach has an intuitive formulation and 

can be realized using an iterative algorithm. Asymptotic properties are rigorously established. 
Simulation shows that the proposed approach has satisfactory performance and outperforms 
meta-analysis using penalization. An NHL (non-Hodgkin lymphoma) prognosis study with 
SNP measurements is analyzed. Genes associated with the three major subtypes, namely DL- 
BCL, FL, and CLL/SLL, are identified. The proposed approach identifies genes different from 
alternative analysis and has reasonable prediction performance. 

Keywords: Cancer prognosis; Integrative analysis; Marker selection; Penalization. 

1 Introduction 

Profiling studies have been extensively conducted in cancer research, searching for SNPs (single 
nucleotide polymorphisms) and genes that are associated with prognosis. Cancer is a heterogeneous 
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disease. Different subtypes of the same cancer usually have different prognosis patterns and different 
associated genes/SNPs. Consider NHL (non-Hodgkin lymphoma), which is a heterogeneous group 
of malignancies ranging from very indolent forms to aggressive ones. As discussed in Zhang et al. 
(2011), different subtypes of NHL are largely different. For example, DLBCL, the largest subtype, 
is aggressive, whereas FL, the second largest subtype, is indolent. Chromosomal translocations 
such as t(3, 22) are specific to DLBCL, whereas others such as t(14, 18) are specific to FL. On the 
other hand, different subtypes may also share common susceptibility genes/SNPs. Genes in the 
cell cycle, multiple signaling, RAS, and DNA repair pathways are involved in the development and 
progression of multiple cancers including NHL. For NHL, Han X. et al. (2010) and Ma et al. (2010) 
have found that SNPs in multiple genes, such as BRCA2, CASP3, IRF1, BCL2, NAT2, ALX012B, 
are associated with both DLBCL and FL. Investigating the similarity and difference in the genetic 
basis of multiple subtypes of the same cancer can lead to better understanding of the connections 
and distinctions among subtypes (Rhodes et al. 2004; Goh and Choi 2012). 

A comparable setting where the discussion and proposed method are relevant is the analysis of 
prognosis data on multiple types of cancers. As discussed in Rhodes et al. (2004) and followup 
studies, susceptibility genes shared by multiple types of cancers are more likely to represent the 
more essential features of cancer, whereas cancer type-specific genes determine the distinctions 
among different cancers. 

When multiple subtypes of the same cancer are of interest, as discussed in Zhang et al. (2011), 
most studies analyze each subtype separately and then compare results across subtypes. For NHL, 
Han et al. (2010) and Ma et al. (2010) take this approach. Such a strategy fits in the classic 
meta-analysis framework. With high-dimensional measurements such as SNPs, data on individual 
subtypes have the "large d, small n" characteristic, with the sample size n much smaller than the 
number of SNPs d. Because of the low sample size, susceptibility genes/SNPs identified from the 
analysis of each subtype may have unsatisfactory properties. Recent studies have shown that, when 



multiple datasets (multiple subtypes in this study) have overlapping susceptibility SNPs/genes, 
integrative analysis can analyze raw data of multiple datasets simultaneously and generate improved 
analysis results over the analysis of individual datasets and meta-analysis (Liu et al. 2012; Ma et 
al. 2009; Ma et al. 2012). 

With data on multiple subtypes, the goal is to identify genes associated with prognosis. For 
marker identification, we adopt penalization, which has been extensively applied to the analysis 
of cancer prognosis data with high-dimensional genetic measurements. Single-dataset penalization 
methods, such as Lasso, SCAD, bridge, MCP and their group counterparts, cannot be directly ap- 
plied to the analysis of multiple datasets. With multiple datasets, the homogeneity model assumes 
that, if a functional unit (gene or SNP) is identified, it is concluded as associated with prognosis 
in all datasets (Liu et al. 2012). An alternative to the homogeneity model is the heterogeneity 
model, under which a gene or SNP can be associated with prognosis in some datasets but not the 
others. Under the heterogeneity model, research on penalization methods has been limited (Liu et 
al. 2012). Compared with the existing studies which analyze gene expression data, the present one 
has additional complexity. One gene may consist of multiple SNPs, and it is important to allow 
different effects for SNPs within the same gene. In addition, theoretical properties of the methods 
developed in Liu et al. (2012) and others have not been established. To the best of our knowledge, 
the only available method that is tailored to the type of data analyzed in this study is Ma et al. 
(2012), which adopts thresholding for marker selection. The thresholding method does not have a 
well-defined objective function. Thus its properties can be very difficult to establish. In addition, 
it may have more tuning parameters than the penalization method. Compared with the existing 
studies, another advancement of this study is the analysis of a prognosis data on NHL, which may 
provide insights into the genetic basis of this deadly disease. 

The integrative analysis of data on multiple subtypes of cancer can be challenging. With some 
cancers, the subtype information may be only partial or even wrong. In addition, the definitions 



of subtypes are still evolving. For NHL subtypes, we refer to Zhang et al. (2011) and references 
therein for relevant discussions. When there are a large number of subtypes, the set of subtypes 
chosen for analysis needs to be jointly determined by the scientific question of interest, quality 
of data, sample size, evidence from epidemiologic studies and other factors. We acknowledge the 
importance and difficulty of these issues. In this study, we focus on the development of a new 
analysis approach and refer to other publications for relevant discussions. 

2 Integrative Analysis under the Heterogeneity Model 

2.1 Data and model settings 

Assume that there are data on M subtypes of the same cancer, and there are n m iid observations 
for subtype m(= 1, . . . , M). The total sample size is n = ^2 m n m . For subtype m, denote T m as 
the logarithm of failure time. Denote X™ as the length-d covariate vector (SNPs in this study). 
The subscript "o" is used to discriminate the original (versus weighted) covariates. For simplicity 
of notation, assume that different subtypes measure the same set of covariates. In practice, if a 
covariate is not measured for a specific subtype, its corresponding regression coefficient will be set 
as zero. In penalization, rescaling is used to accommodate partially matched covariate sets. 
For subject i of subtype m, the AFT (accelerated failure time) model assumes that 

Tr = ^ + X™>f3 m + eT, i = l,...,n m . (1) 

where fl™ is the intercept, f3 m C M. d is the regression coefficient, and ef 1 is the error term. As 
Tf 1 is subject to right censoring, we observe (Y£, tf™,X^), where Y™ = minlr™, C™}, C™ is the 
logarithm of censoring time, and 6™ = I^T™ < C™} is the event indicator. 

Compared with alternatives such as the Cox model, the AFT model has a much simpler ob- 
jective function, as demonstrated in the next section, and hence significantly lower computational 
cost. Such a property is particularly desirable for high-dimensional data. In addition, it directly de- 
scribes event times, and its regression coefficients may have more lucid interpretations than those 



in alternative models. As there is a lack of model diagnostics tools for high-dimensional data, 
alternative models will not be discussed. 

2.2 Weighted least squares estimation 

Let F m be the Kaplan-Meier estimator of the distribution function F m of T m . Following Stute 

in. 

An 



(1996), F m can be written as F m (y) = Ya=i ^T^-Wm-) — v}' where the cj™'s are the jumps in the 



Kaplan-Meier estimator and can be expressed as 
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^m < " " ' < ^Mn m ) are ^ ne or der statistics of l^f's, and <5^, . . . , #^ m ) are the associated event 
indicators. Similarly, let X^J^, . . . , X"? m -, be the associated covariate vectors of the ordered 17" 's. 
Consider the weighted least squares objective function 

1 " m 2 

L m (/3 m , D = - £ w ™ ( y o7;) - ^ - ^o™ )'/3 m ) • (2) 

Let A w - 2^1=1^ A o(j)//-,i=i w i > y u, - l^i=\^i y (i)l L.%=\ u: i . A w (i) - K J v A D (i) - 
X™), and l^j-v = (wf 1 ) (Y£u\ — Y™). Using the weighted centered values, the intercept is zero. 
The weighted least squares objective function can be written as 

1 " m 2 

^(n^E^w-W") • (3) 

This simple form makes computation affordable even with high-dimensional data. 

Assume independence between data for the M subtypes. Consider the overall objective function 

M 1 

m=l 

Here we normalize L m by n m so that the analysis is not dominated by large subtypes. When larger 
subtypes are of more interest, the unnormalized objective function may be considered. 



2.3 Heterogeneity model 

As formulated in Liu et al. (2012), two different models, namely the homogeneity model and 
heterogeneity model, can be applied to describe the genetic basis of multiple subtypes. Under the 
homogeneity model, it is postulated that multiple subtypes share the same set of susceptibility 
SNPs/genes. Considering the significantly different prognosis patterns of different subtypes, this 
model may be too restricted. Under the heterogeneity model, the sets of susceptibility SNPs/genes 
may differ across subtypes. The heterogeneity model includes the homogeneity model as a special 
case and can be more flexible. 

To more explicitly describe the data and model settings, heterogeneity model, and our analysis 
strategy, consider a hypothetical cancer study with three subtypes and eight SNPs representing 
four genes (Table 1). Gene 1 is associated with the prognosis of all three subtypes; Gene 2 is 
associated with the first two subtypes but not the third one; Gene 3 is associated with the third 
subtype only; And gene 4 is not associated with any subtype. In Table 1, we show the regression 
coefficient matrix whose main characteristics reflect the essence of integrative analysis under the 
heterogeneity model. Unimportant genes/SNPs not associated with prognosis have no effects and 
so zero regression coefficients. With penalization approaches including the proposed one, marker 
identification amounts to identifying the sparsity structure of models. For an important gene/SNP 
(for example SNP 1_1), its strengths of association with multiple subtypes, which are measured 
with regression coefficients, may be different for different subtypes. With SNP data, analysis can 
be conducted at multiple levels, particularly including SNP-level and gene-level. In this study, 
we conduct gene- level analysis, which complements SNP-level analysis. As the goal is to identify 
important genes that contain prognosis-associated SNPs, within an important gene, no further 
selection is conducted. Thus, SNPs within the same gene have the "all in or all out" property. 
Such a strategy has been adopted in Ma et al. (2012) and is different from that for SNP-level 
analysis approaches. 
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3 Penalized Marker Selection 

Penalization is adopted for marker selection. Under the data and model settings described in the 
previous section, existing penalization approaches are not directly applicable. A new penalization 
approach is described in Section [3.11 An effective computational algorithm is proposed in Section 
13.21 Tuning parameter selection is discussed in Section 13.31 Some practical concerns are discussed 
in Section 13.41 Asymptotic properties and proofs are established in Appendix. 

3.1 Penalty function 

Assume that the d SNPs belong to J genes. To accommodate the scenario with partially matched 
gene sets, suppose that Mj subtypes (studies) measure gene j. Without loss of generality, assume 
that gene j is measured in the first Mj subtypes. Denote djk as the number of SNPs in the jth 
gene and kth subtype with coefficient vector /3jk = (/3k, . . . , f3A k )', j = 1, . . . , J, k = 1, . . . , Mj. 
The subscript k is kept to accommodate partially matched SNP sets for the same genes. Then 
(3j = (fi'ji, ■ ■ ■ iP'jm)' ls t ne regression coefficient for all SNPs in the jth gene across all subtypes. 
Here the notations are slightly more complicated than those in Section 2 to accommodate the 
"SNP-within-gene" hierarchical structure and partially matched SNP/gene sets. j3 = (/3[, . . . , /3j)' . 
Consider the penalized estimate 

P = argmin{L(f3) + P Xn ,M}. 

A nonzero component of /3 indicates an association between the corresponding gene (SNP) and 
subtype's prognosis. Consider the penalty function 

j (M> y 

j=l \k=l J 

where X n > is a data-dependent tuning parameter, Cj ex M~ 7 is a constant accommodating 
partially matched gene sets, || • || is the L2 norm, and < 7 < 1 is the fixed bridge parameter. 



The above penalty has been designed to tailor the special model characteristics as described 
in Table 1. In our analysis, genes are the basic functional units. The penalty is the sum of J 
individual terms, with one for each gene. For a specific gene, two levels of selection need to be 
conducted. The first is to determine whether it is associated with any subtype at all. This step 
of selection is achieved using a bridge penalty. For a gene associated with at least one subtype, 
the second level of selection is to determine which subtype(s) it is associated with. This step of 
selection is achieved using a Lasso-type penalty. The composition of the bridge-type penalty and 
the Lasso-type penalty can achieve the desired two-level selection. One gene may contain multiple 
SNPs. The effect of gene j for subtype k is represented by the vector 5,-fc. Here the penalty is 
imposed on the L2 norm of (3jk, which can be viewed as the square root of a ridge penalty. Thus, 
within a selected gene, no further SNP-level selection is conducted. If a gene is selected, all SNPs 
within this gene are selected. 

When djk = 1 (one SNP per gene), penalty @ becomes the group bridge, which has been 
developed for the analysis of a single dataset in Huang et al. (2009). This study is among the 
first to apply group bridge type penalization in integrative analysis. In addition, the proposed 
penalized estimation can be more complicated than that in Huang et al. (2009) by accommodating 
the "SNP-within-gene" hierarchical structure. Using composite penalization for marker selection 
under the heterogeneity model has been proposed in Liu et al. (2012) for diagnosis studies with 
binary responses. The penalty in Liu et al. (2012) is built on the composition of MCP and 
Lasso, which is computationally more expensive, and cannot accommodate the "SNP-within-gene" 
structure. 

3.2 Computational algorithm 

For subtype m, denote Y m as the vector composed of -J=LY™s, and X m as the matrix composed 
of -^Lx™s. Let Y = (Y y , ..., Y M ')' and X = diag(X 1 , . . . , X M ). Denote Xj as the submatrix of 
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X corresponding to f3j, and its dimension is n x J^fc=i djk- Then 



M n m 1 J 
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The overall objective function is 
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(6) 



(7) 



where = (9\, . . . , Oj)', and r n is a penalty parameter. 

Proposition 3.1. If X n = T n _7 7~ 7 (l — 7) 7_1 , then J3 minimizes the objective function in {6jj if 
and only if (/3, 9) minimizes S(/3, 9) subject to 9j > for all j. 

Proof is provided in Appendix. Examining S(/3, 9) suggests that optimization with respect to 
j3 and 9 can be conducted "separately". Optimization with respect to 9 has a simple analytic 
solution. Optimization with respect to f3 has a weighted group Lasso type objective function, for 
which there exist effective algorithms. Motivated by such an observation, consider the following 
algorithm: 

1. Denote fl(°' as the initial estimate. In our limited numerical study, the choice of initial 
estimate does not have much impact on the final estimate. For simplicity, all components of 
ft ' are set to be 1. Set s = 0; 



2. s = s + 1. Compute 



t=C3 



1-7 

7 r n 
J 



7 / Mj 



£V^ 



jk 



. fe=l 



a{s-l) 

Pjk 



J Mj 

^=argmin p { ±\\Y - J2 X M* + £(^ ) ) 1 ~ 1/7 S 1/T £ V^IIPiH 

3=1 3=1 fc=l 



(8) 



(9) 



3. Repeat Step 2 until convergence. 

In numerical study, we use the L<i norm of the difference between two consecutive estimates less 
than 0.001 as the convergence criterion. The proposed algorithm always converges, since at each 
step, the nonnegative objective function decreases. It is noted that as the group bridge type penalty 
is not convex, the algorithm may converge to a local minimizer depending on the initial value /r™ . 
Using the proposed initial value works well in our numerical study. The main computational task 
is the computation of /3' s ', which is a group Lasso type estimate and can be achieved using a group 
coordinate descent algorithm (Huang et al. 2012; Liu et al. 2012). Convergence of the group 
coordinate descent algorithm can be derived following Tseng (2001). 

3.3 Tuning parameter selection 

The proposed penalty involves two tuning parameters 7 and X n . In the study of bridge type 
penalties (Huang et al. 2009), the value of 7 is usually fixed. Theoretically speaking, different 
values of 7, as long as in the interval (0, 1), lead to similar asymptotic results. In practice as 
7 — > 1, the bridge type penalty goes to the Lasso type penalty; On the other hand, as 7 — > 0, it 
behaves similarly to AIC/BIC type penalties. In our numerical study, we experiment with a few 
7 values, particularly including 0.5, 0.7 and 0.9. The effect of A n is similar to that with other 
penalties. As A n — > 00, fewer genes/SNPs are identified. 

As the function L{j3) has a least squares form, we propose using BIC for tuning parameter 
selection. Particularly, with a fixed 7, the optimal A n minimizes 

BIC(A n ) = log \\\Y - A/3(A n )|| 2 /n} + log(n)df(A„)/n. 

Here we use the notation /3(A n ) to emphasize the dependence of /3 on A n . Motivated by Yuan and 
Lin (2006), an approximation of the degree of freedom is adopted as 

J m j J M i no ,, 



D jk\ 

3LS1 
3=1 fc=l j'=l k=l Wrjk I 



df(An) = EE / (ii^ii> ) + EE M l^^- 1 )- ( 10 ) 
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Here /3^ is obtained by fitting an AFT model (with least squares estimation) using the jth gene 
and kth subtype only. 

3.4 Practical considerations 

With practical data, minor allele frequencies in some loci can be low. This may cause an instability 
problem in the Cholesky decomposition when some eigenvalues of the correlation matrices are 
too small. In the proposed penalized selection, within-gene-SNP level selection is not of interest. 
To reduce the dimensionality within genes and to tackle the colinearity problem, when there is 
evidence of a lack of stability, we first conduct principal component analysis (PCA) within genes. 
Specifically, we choose the number of PCs such that at least 90% of the total variation is explained. 
Then the PCs, as opposed to the original SNP measurements, are used in downstream analysis. 
Our empirical study suggests that this simple step may ensure that the smallest eigenvalues of the 
covariance matrices are not too small and that the Cholesky decomposition is stable. 

4 Simulation Study 

Three datasets (subtypes) are simulated, each with 100 subjects. For each subject, the genotypes 
of 200 genes are simulated, each with 5 SNPs. There are thus a total of 1000 SNPs for each 
subtype. The genotypes are first generated from multivariate normal distributions. Then the value 
of each SNP is set equal to 0, 1, or 2, depending on whether the continuous (normally distributed) 
value is < — c, £ [— c, c], or > c, where c is the 3rd quartile of the standard normal distribution. 
Thus on average, each SNP has equal allele frequencies. Genotype j and k, if from different genes, 
have correlation coefficient 0.2^ ' . For genotypes from the same genes, consider the following two 
correlation structures. The first is the auto-regressive correlation, where genotype j and k have 
correlation coefficient p"~ k K p = 0.2, 0.5, and 0.8, corresponding to weak, moderate, and strong 
correlations, respectively. The second is the banded correlation structure. Here three scenarios are 
considered. Under the first scenario, genotype j and k have correlation coefficient 0.2 if \j — k\ = 1, 
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0.1 if \j — k\ = 2, and otherwise; Under the second scenario, genotype j and k have correlation 
coefficient 0.5 if \j — k\ = 1, 0.25 if \j — k\ =2, and otherwise. Under the third scenario, genotype 
j and k have correlation coefficient 0.6 if \j — k\ = 1, 0.33 if \j — k\ =2, and otherwise. 

Consider two cases of the nonzero regression coefficients. In case 1, the nonzero regression 
coefficients for subtype 1 and 2 are (0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 
0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1), and the nonzero coefficients for subtype 3 are (0.1, 0.1, 0.1, 
0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1). In case 2, 
the nonzero regression coefficients for subtype 1 and 2 are (0.15, 0.15, 0.15, 0.15, 0, 0, 0.1, 0.1, 0.1, 
0.1, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1), and the nonzero coefficients for subtype 3 are 
(0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0, 0.15, 0.15, 0.15, 0.15, 0, 0, 0.1, 0.1, 0.1, 0.1). Thus, 
across the three subtypes, there are 60 SNPs associated with prognosis, representing 12 genes. 

Two scenarios under the heterogeneity model are considered. Under the first scenario, all 
three subtypes share three common susceptibility genes, and each subtype has one subtype-specific 
susceptibility gene. The "unmatching rate" of susceptibility genes is thus 25%. Under the second 
scenario, the three subtypes share two common susceptibility genes, and each subtype has two 
subtype-specific susceptibility genes. The unmatching rate of susceptibility genes is 50%. As a 
special case of the heterogeneity model, the homogeneity model is also considered, under which all 
three subtypes have the same susceptibility genes. 

The logarithms of event times are generated from the AFT models with intercept equal to 
0.5 and normally distributed random errors. The logarithms of censoring times are generated as 
uniformly distributed and independent of the event times. The censoring distribution parameters 
are adjusted so that overall censoring rate is about 30%. 

Beyond the proposed approach, simulated data are also analyzed using a meta-analysis ap- 
proach. Here each subtype is analyzed using the group Lasso (GLasso) approach, where a "group" 
corresponds to one gene with multiple SNPs. Then the identified gene lists are combined across 
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subtypes. With both approaches, the tuning parameters are chosen using BIC. With the proposed 
approach, we experiment with 7 =0.5, 0.7, and 0.9. Summary statistics on gene identification 
accuracy based on 100 replicates are shown in Table 2-3 and 5-8 (Appendix). 

Simulation suggests that performance of the proposed approach depends on the correlation 
structure, values of nonzero regression coefficients, and 7 value. As correlation gets stronger, in 
general, more true positives and more false positives are identified. We fail to observe a clear pattern 
for the dependence (of the performance of proposed approach) on nonzero regression coefficients. 
As 7 gets larger, also more true positives and more false positives are identified. This observation 
is reasonable, considering that when 7—7-1, bridge penalization becomes close to Lasso, and that 
Lasso-type penalization tends to over-select. Under almost all simulated scenarios, the proposed 
approach identifies more true positives than GLasso. For example in Table 2, under the AR 
correlation with p = 0.5, GLasso identifies 6.7 true positives, whereas the proposed approach 
identifies 10.7, 10.8, and 11.0 true positives under different 7 values. Under all simulated scenarios, 
the proposed approach identifies much fewer false positives. For example in Table 2, under the 
banded correlation scenario 2, GLasso identifies 38 false positives, whereas the proposed approach 
identifies 4, 4.3, and 9.8 false positives under different 7 values. Observations under the homogeneity 
model (Table 7 and 8) are similar. We have experimented with a few other settings and reached 
similar conclusions. 

5 Analysis of NHL Genetic Association Data 

NHL is the fifth leading cause of cancer incidence and mortality in the US and remains poorly under- 
stood and largely incurable. A genetic association study was conducted, searching for SNPs/genes 
associated with overall survival in NHL patients (Zhang et al. 2005). The prognostic cohort consists 
of 575 NHL patients, among whom 496 donated either blood or buccal cell samples. All cases were 
classified into NHL subtypes according to the World Health Organization classification system. 
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Specifically, 155 had DLBCL, 117 had FL, 57 had CLL/SLL, 34 had MZBL, 37 had T/NK-cell 
lymphoma, and 96 had other subtypes. Because of sample size consideration, we focus on DLBCL, 
FL, and CLL/SLL, the three largest subtypes in this dataset. The study cohort was assembled 
in Connecticut between 1996 and 2000. Vital status of all subjects was abstracted from the CTR 
(Connecticut Tumor Registry) in 2008. 

When genotyping, we took a candidate gene approach. Specifically, a total of 1462 tag SNPs 
from 210 candidate genes related to immune response were genotyped using a custom-designed 
GoldenGate assay. In addition, 302 SNPs in 143 candidate genes previously genotyped by Taqman 
assay were also included. There were a total of 1764 SNPs, representing 333 genes. Data pre- 
processing is conducted. Subjects with more than 20% SNPs missing are removed from analysis; 
Then SNPs with more than 20% missing are removed. The genotyping data were missing for the 
following reasons: the amount of DNA was too low, samples failed to amplify, samples amplified 
but their genotype could not be determined due to ambiguous results, or the DNA quality was 
poor. The remaining missing SNP measurements are then imputed. A total of 1,633 SNPs pass 
processing, representing 238 genes. 

For DLBCL, 139 patients pass processing. Among them, 61 died, with survival times ranging 
from 0.47 to 10.46 years (mean 4.16 years). For the 78 censored patients, the follow up times range 
from 5.58 to 11.45 years (mean 9.08 years). For FL, 102 patients pass processing. Among them, 
33 died, with survival times ranging from 0.91 to 10.23 years. For the 69 censored patients, the 
follow up times range from 4.96 to 11.39 years, with mean 8.83 years. For CLL/SLL, 50 patients 
pass processing. Among them, 27 died, with survival times ranging from 1.91 to 10.13 years (mean 
4.85 years). For the 23 censored patients, the follow up times range from 4.92 to 11.07 years, with 
mean 8.83 years. 

Analysis results using the proposed approach are shown in Table 4 and 9-11 (Appendix). In 
particular, Table 4 contains the Li norms of the identified genes, whereas Table 9-11 contain the 
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estimated regression coefficients for SNPs. Fourteen genes are identified as associated with the 
overall survival of DLBCL; Twelve genes are identified as associated with FL; And five genes are 
identified as associated with CLL/SLL. Among the identified genes, MBP and STAT4 are shared 
by all three subtypes, ALOX5, IL10, IRAK2, LMAN1, MIF, and NCF4 are shared by two subtypes, 
and thirteen other genes are identified as subtype-specific. 

Among genes shared by multiple subtypes, gene ALOX5 encodes a member of the lipoxygenase 
gene family and plays a dual role in the synthesis of leukotrienes from arachidonic acid. Mutations 
in the promoter region of this gene lead to a diminished response to antileukotriene drugs used 
in the treatment of asthma and are associated with atherosclerosis and several cancers. Studies 
that have identified this gene as a marker of NHL include Mahshid et al. (2009), Feltenmark 
et al. (1995) and others. The protein encoded by gene IL10 is a cytokine produced primarily 
by monocytes and to a lesser extent by lymphocytes. This cytokine has pleiotropic effects in 
immunoregulation and inflammation. It down-regulates the expression of Thl cytokines, MHC class 
II Ags, and costimulatory molecules on macrophages. It also enhances B cell survival, proliferation, 
and antibody production. This cytokine can block NF-kappa B activity, and is involved in the 
regulation of the JAK-STAT signaling pathway. The involvement of IL10 in NHL has been proposed 
in Bi et al. (2012) and Deng et al. (2013). IRAK2 encodes the interleukin-1 receptor-associated 
kinase 2, one of two putative serine/threonine kinases that become associated with the interleukin-1 
receptor (IL1R) upon stimulation. It is identified as associated with NHL in Ngo et al. (2011). The 
protein encoded by gene LMAN1 is a type I integral membrane protein localized in the intermediate 
region between the endoplasmic reticulum and the Golgi. The protein is a mannose-specific lectin 
and a member of a novel family of plant lectin homologs in the secretory pathway of animal 
cells. Mutations in the gene are associated with a coagulation defect. It has been identified as 
a marker for gastric, colorectal, and prostate cancer. Myelin basic protein (MBP) is a protein 
important in the process of myelination of nerves in the central nervous system (CNS). MBP plays 
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an important role in demyelinating diseases such as multiple sclerosis. Its involvement in NHL has 
been discussed in Hu et al. (2012) and Han et al. (2010). Gene MIF encodes a lymphokine involved 
in cell-mediated immunity, immunoregulation, and inflammation. It plays a role in the regulation 
of macrophage function in host defense through the suppression of anti-inflammatory effects of 
glucocorticoids. This lymphokine and the JAB1 protein form a complex in the cytosol near the 
peripheral plasma membrane, which may indicate an additional role in integrin signaling pathways. 
Studies such as Xue et al. (2010) and Talos et al. (2005) have identified it as a marker of NHL. The 
protein encoded by gene NCF4 is a cytosolic regulatory component of the superoxide-producing 
phagocyte NADPH-oxidase, a multicomponent enzyme system important for host defense. It is 
identified as an NHL susceptibility gene in Kim et al. (2012). The protein encoded by gene STAT4 
is a member of the STAT family of transcription factors. In response to cytokines and growth 
factors, STAT family members are phosphorylated by the receptor associated kinases, and then 
form homo- or heterodimers that translocate to the cell nucleus where they act as transcription 
activators. This protein is essential for mediating responses to IL12 in lymphocytes, and regulating 
the differentiation of T helper cells. Involvement of this gene in NHL risk and progression has been 
discussed in Chang et al. (2010) and Chang et al. (2009). 

The relative stability of identified genes is evaluated using a random sampling approach (Huang 
and Ma 2010). In particular, we randomly sample 3/4 of the subjects and apply the proposed 
approach to identify prognosis-associated genes. This process is repeated 100 times. For each gene, 
we compute the probability of it being identified out of the 100 samplings. This probability is 
referred to as the observed occurrence index in Huang and Ma (2010) and measures the relative 
stability. Table 4 shows that only gene IL10 for DLBCL has a low occurrence index (0.21). All 
other observed occurrence indexes are high, suggesting relatively satisfactory stability. Prediction 
performance is also evaluated using a random sampling approach. In particular, genes are identified 
and models are constructed using 3/4 of randomly sampled subjects. Then prediction is made for 
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the rest 1/4 subjects. Based on the predicted X ml f3 m , subjects are separated into two risk groups. 
The logrank statistic is computed to compare the survival risk of the two groups. This process is 
repeated 100 times, and the mean logrank statistic is computed as 7.1 (p- value 0.0077), suggesting 
satisfactory prediction. 

For comparison, we also analyze each subtype separately using GLasso (results shown in Ta- 
ble 12, Appendix). Twenty-six genes are identified as associated with the prognosis of DLBCL, 
seventeen genes are identified as associated with FL, and eight genes are identified as associated 
with CLL/SLL. Among those genes, three are shared by two subtypes. The identified genes are 
different from those using the proposed approach. Computation of the observed occurrence index 
shows that the identified genes also have satisfactory stability. Prediction evaluation generates a 
logrank statistic of 0.2 (p- value 0.65), which is considerably smaller than that using the proposed 
approach. 

6 Discussion 

In this study, with prognosis data on multiple subtypes of the same cancer, we develop a penal- 
ization approach which can conduct integrative analysis, identify important genes that contain 
SNPs associated with multiple subtypes, and allow for subtype-specific susceptibility genes. The 
proposed approach can be realized using an effective iterative algorithm. Under mild conditions, it 
has the much desired consistency properties. Simulation shows that the proposed approach outper- 
forms penalization-based meta-analysis, with more true positives and fewer false positives. In the 
analysis of NHL prognosis data, it identifies multiple genes shared by two or three subtypes as well 
as subtype-specific genes. The shared genes have important biological implications. The proposed 
approach also leads to significantly better prediction performance. 

To avoid confusion, in our description we focus on the scenario with multiple subtypes of the 
same cancer and the "SNP-within-gene" structure. The proposed approach is directly applicable to 
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the analysis of multiple types of cancers and "gene-within-cluster (pathway)" and other structures. 
In addition, with minor modifications, analysis of prognosis data under other models and analysis 
of diagnosis data can be conducted. The proposed penalty is built on bridge-type penalties. We 
conjecture that it is possible to build on other penalties such as MCP. Our limited investigation 
shows that under the present setup, the proposed penalty may have the lowest computational cost. 
In data analysis, our preliminary search shows that the common genes shared by multiple subtypes 
have important implications. However, because of the following limitations, the analysis results 
should be interpreted with caution. First, the sample size is still limited. Second, the NHL study 
takes a candidate gene approach. It is possible that important genes have been missed in the 
profiling stage. Third, the proposed evaluation is cross-validation based. Although it can compare 
different approaches on the same ground, it does not use completely independent data. More, 
independent studies are needed to fully comprehend the data analysis results. 
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Table 1: Matrix of regression coefficients for a cancer study with three subtypes, four genes and 
eight SNPs. An empty cell corresponds to a zero regression coefficient. 









Subtype 




Gene 


SNP 


SI 


S2 


S3 


1 


1_1 


0.20 


0.19 


0.21 




1_2 


-0.22 


-0.19 


-0.21 


2 


2_1 


0.18 


0.21 






2_2 


-0.21 


-0.21 




3 


3_1 
3_2 






0.21 
-0.18 


4 


4_1 
4_2 









Table 2: Simulation under the heterogeneity model: unmatching rate=25% and nonzero regression 
coefficients under case 1. In each cell, the first row is number of true positives (standard deviation), 
and the second row is model size (standard deviation). 



Correlation GLasso 



Proposed 



7 = 0.5 7 = 0.7 7 = 0.9 



AR p =0.2 


5.7(2.4) 


6.7(5.2) 


7.4(4.6) 


8.9(2.8) 




36.4(16.5) 


9.5(7.6) 


10.7(6.6) 


20.1(6.3) 


ARp=0.5 


6.7(2.4) 


10.7(3.0) 


10.8(2.8) 


11.0(1.9) 




39.7(19.3) 


14.4(4.5) 


14.7(4.0) 


17.1(4.5) 


AR p =0.8 


9.4(2.3) 


12.0(0.2) 


12.0(0.2) 


11.9(0.2) 




50.2(19.3) 


14.4(1.9) 


14.6(1.9) 


15.6(2.7) 


Banded 1 


5.3(2.6) 


7.8(5.0) 


8.6(4.4) 


8.9(3.6) 




33.8(16.2) 


11.0(7.1) 


11.9(5.9) 


20.5(7.0) 


Banded 2 


7.5(2.8) 


10.6(3.3) 


11.2(2.2) 


11.3(1.7) 




45.5(21.7) 


14.6(4.8) 


15.5(3.7) 


21.1(7.0) 


Banded 3 


7.9(2.4) 


11.5(1.9) 


11.5(1.5) 


11.6(1.1) 




44.2(17.7) 


15.4(3.0) 


15.4(2.7) 


18.8(5.2) 
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Table 3: Simulation under the heterogeneity model: unmatching rate=50% and nonzero regression 
coefficients under case 1. In each cell, the first row is number of true positives (standard deviation), 
and the second row is model size (standard deviation). 



Correlation 


GLasso 




Proposed 








7 = 0.5 


7 = 0.7 


7 = 0.9 


AR p =0.2 


4.6(2.3) 


3.7(4.1) 


5.8(3.8) 


8.3(2.1) 




30.7(16.9) 


5.4(7.0) 


10.3(6.9) 


22.2(6.8) 


ARp=0.5 


6.5(3.0) 


9.6(3.5) 


9.7(3.1) 


10.1(2.3) 




36.1(20.1) 


16.8(7.5) 


17.2(6.9) 


21.1(6.8) 


AR p =0.8 


9.7(2.2) 


11.5(0.9) 


11.5(0.7) 


11.6(0.7) 




54.5(17.5) 


19.4(3.9) 


18.4(4.1) 


20.2(5.8) 


Banded 1 


4.7(2.4) 


4.4(4.1) 


5.9(3.8) 


7.7(2.9) 




34.5(17.6) 


6.7(7.2) 


10.8(7.7) 


20.8(9.2) 


Banded 2 


7.5(2.5) 


8.4(3.9) 


9.1(3.1) 


10.0(1.7) 




47.1(19.4) 


14.4(7.7) 


15.9(6.1) 


23.5(7.1) 


Banded 3 


7.3(2.4) 


9.5(2.9) 


9.9(2.6) 


10.1(2.2) 




43.2(19.6) 


16.6(6.7) 


17.7(5.9) 


22.9(7.6) 
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Table 4: Analysis of the NHL data using the proposed approach: L2-norm of estimate for a specific 
gene; OOI: observed occurrence index. 

Gene DLBCL FL CLL/SLL 



L2-norm OOI L2-norm OOI L2-norm OOI 

0.01 0.64 

0.02 0.62 

0.01 0.80 

0.04 0.68 



ALOX12 






0.02 


0.83 


ALOX15B 


0.01 


0.76 






ALOX5 


0.02 


0.71 






CLCA1 






0.02 


0.83 


CSF2 


0.02 


0.86 






DEFB1 


0.03 


0.97 






IL10 


l.E-04 


0.21 






IL17C 






0.02 


0.78 


IRAK2 






0.02 


0.88 


LIG4 






0.01 


0.87 


LMAN1 


0.02 


0.71 


0.02 


0.77 


MBP 


0.01 


0.68 


4.E-03 


0.60 


MCP 


0.01 


0.83 






MEFV 


0.02 


0.85 






MIF 


0.02 


0.83 


l.E-03 


0.53 


MUC6 


0.03 


0.99 






NCF4 


0.01 


0.64 


0.01 


0.64 


PTK9L 






0.01 


0.66 


SERPINB3 


0.01 


0.55 






SOD3 






4.E-03 


0.47 


STAT4 


0.02 


0.95 


0.01 


0.88 



0.01 0.91 
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7 Appendix 

7.1 Statistical properties and proofs 

In this section, we use the same notations as in Section 2 and 3.1. The PCA step described in 
Section 3.4 is optional and mainly for practical consideration. If PCA is actually conducted, the 
djk values may be smaller. 

Proof of Proposition \3.1\ We have min^ g S((3, 9) = ming S((3), 
where S(f3) = mm e {S{f3, 9):9>0}. For any /3, 

J / Mj \ J 

,1-1/7 l/ 7 



9(0) = argmin{5(/3, 9) : 9 > 0} = argmin I £ Vf ?i E ^ WM + r n ^ % 

[i=i \fc=i / j=i 

Therefore, 0(/3) = (0i(/3), . . .,6j(P))' satisfies 

(1/7 - l)^ 1/7 (/3) Cj 1/7 J] 7^ \\0 jk \\ = r n , (j = l,...,J). 



fc=i 



Write £(/0) = S(/3,0(/3)), and substitute the expressions 



x 1-7 M i \ / \ 1-1/7-1-7 



W ^ll-J IZ^V^II^IM , ^ VPy-^_ , A l-7 

* =1 / ^ 7 EfeilV^fcll^fel 



J ' 7 



( (/3)=c J ( T ^) "(Ev^imiJ , ^" 1/7 o?) = (j^) 

into S(j3,O(0)). After some algebra, we obtain 

^ (/3) = i l|y " x/3 " 2 + A "5> ( E V^ 

j=\ \fc=l 

Then if A n = r n _7 7~ 7 (l — 7) 7_1 , <S(/3) and the objective function defined in Section 3.1 are 
equivalent. □ 

Let B\ and F>2 be the index sets of genes with nonzero and zero norms of regression coefficients, 
respectively. Let (3o be the true parameter vector of (3, q m be the size of the set {j : \\Pojm\\ ¥" 0}i 
and q = X^m=i ^ m (that is, the total number of associations between genes and prognosis across 



all subtypes). Without loss of generality, suppose that ||ySojfe|| 7^ for all j = 1, . . . , J\ and some 
k = l,...,Mj. With X as the full design matrix (defined in Section 3.2), denote Xbi as the 
design matrix corresponding to B±. Define S n = n~ 1 X'X, Si„ = n~ 1 X B1 Xsi- Further for 
any index set A, let Xa denote the corresponding design matrix, and S^ = u~ 1 X' a Xa- Let 
Ai = {(j,k):\\p 0jk \\^0}. 

We make the following assumptions: (Al) M is finite, q is finite. For subtype m (= 1, . . . ,M), 
{(Y™, 8™, X^),i = 1, . . . , n m } are iid. The errors {e™, . . . , e™ m } are iid with mean zero and finite 
variance (a m ) 2 . Denote e m = (ef , . . . , e™ m )', and e = (e 1 ', . . . , e M ')' . Denote a 2 = max m (cr m ) 2 . Th e 
ith component of e m , ef 1 , is subgaussian, in the sense that there exist K\, K<i > such that the tail 
probability satisfies -P(|e™| > u) < K2exp(—Kiu 2 ) for all u > 0. (A2) For subtype m (= 1, . . . , M), 
the errors (e™, . . . , e™ m ) are independent of the Kaplan-Meier weights (w™, . . . , ui™ m ). The covariates 
are bounded. (A3) The design matrix satisfies the sparse Riesz condition (SRC) with rank q* . 
That is, there exist constants < c* < c* < oo, such that for q* = (3 + AC)q and C = c* /c*, with 
probability converging to 1, c* < ",, A u < c*, \/A with \A\ = q* and z> € R 9 *. In addition, the SRC 
condition holds for the design matrix of each subtype separately with rank q m * for subtype m. (A4) 
A n (logd/n) 7/2_1 ->■ oo. Let r/„ = max, c, ^fe=i V^^H^'fell) • ^/=i Ylk=i \fdjkWojk\\ = 0(1) 
and r] n = 0(1). 

The above assumptions are in parallel with those in Huang and Ma (2010). Further compli- 
cations are introduced to accommodate the multi-datatsets setting, heterogeneity across subtypes, 
and "SNP-within-gene" structure. Main properties of the penalized estimate can be summarized 
as follows. 



Theorem 7.1. Suppose the (A1)-(A3) hold and \ n > 0(l) v / log(d)/n. Then 

1. With probability converging to 1, \A\\ < (2 + AC)q. 

2 . ||/3 -/3 || 2 < 2J ^ + 0J 1 -^^). In particular, if X n = 0(1) y/log(d)/n, then ||/3-/3 | 



Op (logd/ra) . 

The above result establishes that the number of identified genes is a finite multiply of the true 
number of associated genes, which is assumed to be finite in (Al). In addition, if \og{d)/n — > 0, 
the estimate is L2 estimation consistent. The selection and estimation results can be further 
strengthened as follows. 

Theorem 7.2. Suppose that (A1)-(A4) hold. 

1. It holds that f3B 2 =0 with probability converging to 1. 

2. Suppose that {i?i,A)Bi} are fixed unknown. In the AFT models, the subgaussian condi- 
tions are strengthened to normal distributions. E™ ln77l = (n m )~ l ' 2 (X™ B m)'X™ Bm — > E^ 
and n- l ' 2 X' Bl e = Em=i(0 _1/2 P&O'e™ ~> Z ~ ^(0, E™=i(^ m ) 2s ™i) in distribution. 
Then, in distribution, 

Vn0 Bl - Pobi) -> argmin{Vi(u) : u G R lBl1 }, 
where 

\Bi\ (M t y 1 M 3 

V x {u) = -2u'Z + n'Em + 7 A ^ Cj J^ V^H/M ^{ujkjj^KWPojk + 0||) 

+||«ifc||i(||A)i* = o)||}, 

with Ujk corresponding to the component of gene j and subtype k. 

The above result establishes that under mild conditions, the proposed approach can consistently 
identify genes that are associated with at least one subtype. This result is consistent with that 
in Huang et al. (2009) for the analysis of a single dataset. In addition, when the random errors 
have normal distributions, the asymptotic distribution can be rigorously established. To prove the 
above theorems, we first establish the following lemma. 

Let C m = (Cf\ ■ ■ ■ , C)', where Cf = w^ej" = u?(Y«fo - X^/3^). 

3 



Lemma 7.3. Suppose that assumption (A2) and (A3) hold. Let £™ = XJ 1 '^, 1 < j < d, where 
Xf is the jth column of X m . Let £™ = maxi< j < d \CJ l \. Then. 



E(C) < C lv /Md)(^2C 2 n-log(d)+ 41og(2d) + C 2 n n 
where C\ and C 2 are two positive constants. In particular, when log(d)/n m — > 0, 



E(C) = 0(l)v / n m log(d). 



Proof of Lemma \ 7. 3\ Let s^ m - = X^XJ 1 . Conditional on Xj"'s, assumption (A2) and (A3) imply 



that ^'s are subgaussian. Let s^m = maxi<j<,iS„m,-. By (A2) and the maximal inequality for 
subgaussian random variables (Van der Varrt and Wellner 1996, Lemmas 2.2.1 and 2.2.2), 



E max \W\ 

l<j<d 'J 



xf,i<j<d)< c 1Sn m y/iBgid), 



for a constant C\ > 0. Therefore, 



E( max iCflj < Ci^og(^E(* n m). (11) 

Since 

n m 2 

£e(*$J-EX$j) <4C 2 n m , (12) 

and 



^£ E ^^. (13) 

By Lemma 4.2 of Van der Geer (2008), (H2|) implies 



E max 
I i<j<d 



Therefore, by (fl~3l) and the triangle inequality, 



< v / 2C 2 n m log(d) +41og(2d). (14) 



E^ m < V2C 2 n m log(d) + 41og(2d) + C 2 n m . 



Now since Es„m < (Es 2 ™) 1 ' 2 , we have 



1/2 



Es n m < ^2C 2 n m log(d) + 41og(2d) + C 2 n m j . (15) 

The lemma follows from (1111) and (1151). D 



Proof of Theorem 7.1, Part (i) follows from the proof of Theorem 1 of Zhang and Huang (2008). 
One difference is that here a subgaussian assumption is made, which is weaker than the normality 
assumption in Zhang and Huang (2008). Since subgaussian random variables have the same tail 
behaviors as normal random variables, the argument of Zhang and Huang (2008) goes through. In 
addition, the SRC condition is imposed at a group level, as opposed to individual covariate level, 
to accommodate the "SNP-within-gene" structure, 
(ii) By the definition of /3, 

j=l \k=l J j=l \k=l 

Thus 

3=1 \k=l j j=l \k=l 

Using C, = Y — X(3q , we have 

_L||y _ x/3|| 2 - ^-\\Y - X/3 || 2 = ±-\\X0 - /3 )|| 2 - -('X0 - /3 ). (16) 

2n in in n 

Let B = B x U A x = {(j,k) : \\0ojk\\ + or \\/3 jk \\ / 0}. Note that \B\ < q* with probability 
converging to 1 by part (i), where q* is defined in (A3). Denote tjb = Xb(/3b — Pob)- 



Since W - a* < 2(6 - a)^" 1 for < a < b, 



Ji Mj , h M j 

A -E c ^E V^kWoikW - An^ c iE ^wav 

j=l fc=l j=l k=l 

Ji / Afj Mi 

< 2A n ^ ^(^v^ll^fclir-^v^lll^ifcll - ll&fcl 

j=i y fc=i fc=i 

< 2A n r? n ^^ V^fclll^OjJfcll - ll^jfclll 

3=1 fe=l 

< 2A n 7? n ||j9B -jSqbII- 



(17) 



Following from (|16p and (|17j) . we have 



1,/ 



Ti-lkslr C ^7s < 2A 7l 7 7ri ||/3s 



2» 



n 



Let C_B be the projection of ( to the span of Xb, i.e., C,B = -Xb^b-Xb) ^bC- We have 

C' m = c'x b 0b - M = {{x' B x B y 1,2 x' B c^ {(x B x B ) 1/2 B - M} ■ 



Therefore, by the Cauchy-Schwarz inequality, 



(18) 



\('vb\ < \\(b\\ ■ \\vb\\ < \\Cb\\ 2 + ~:\\vb\ 



(19) 



Combining (JTSJ and ([15]) . 



^||^|| 2 <-||CB|| 2 + 4A n Cn||/3B 
2n n 



(20) 



By the SRC condition (A3) ,tt.~ 1 || 77^ ||^ > Cjf \\f3 B - M\ ■ Thus d^DJ implies 



^B 



!' 



4 * (4A n ?? n ) 2 c* 



I 2 < -IICbII 2 + -^ nl > n > +^Wb- m 2 . 



It follows that 



b-A^M + flNsaO! 



rac* 



(21) 



Now 



wcb\\ 2 = ii(^^r 1/2 ^cii 2 <^-wcf 



nc* 

< — max IIX^CII 2 < — max max — WX'^Cf. 

We have 

max H^CVCH 2 = max V \X™'( m \ 2 < q m * max \X™'C\ 2 - 

jeA m 

By Lemma 17.31 

max \X™'C m \ 2 = n m max \(n m y^ X™'( m \ 2 = OJn m \ogd). 



l<j<d 

Therefore, 

\\(B\\ 2 =O p (q*logd). (22) 

The result follows from (ED and (E2D- D 



Proof of Theorem \7.S\ (i) We prove the first part of Theorem 17.21 
Define J3 by 

J \o, (jeB 2 )- 

The Karush-Kuhn- Tucker condition for Q implies that 

i(Y - xft'Xfl = y Cj \ n f J2 Vdfk\\M\ ) 7^4^/(||/3,H| / 0), 
n \ fc =i / llPiill 

where Xji is the sub-matrix corresponding to the jth gene of the Ith subtype. 



ii^ii 



Since 0jt - ^-O'life = ll/^ll^(IIMI = 0), we have 



Mi 



■/? 



■(Y-X$)'X0-P) 



Yl 79 A « Zv^ 



<jfc||P.jfc| 



O' = IIA)ij||=0}3j 



,fc=l 



M,- 



Z) TCjA» J]v/d 



O':||/3oii||=0}9j 



,fc=l 



J / M, 

7AnZ c i ZV^fell^ 

j=l \fc=l 



■jfcllpifcl 



7-1 



7-1 



7-1 



M. 



VI 



■jl \\\Pjl\ 



>jl\ 



M, 



Ev^ 



■ji\\Pji\ 



Ev^ll^z 



z=l 



z=i 



Since 7& 7 1 (& — a) < 6 7 — a 7 for < a < 6, for j < Ji, we have 

,7-1 r/_ Af 






,M,- 



Since (Ez=i v^fll$?'j|l ) = for j > Ji, this implies that 



A/, 



A/, 



7 ^ 



-\(Y-X$)'X0 

n 



< 



A ^E c i \ 12Vdji\\ 



Ev^ 



■i* 



!=1 



Mi 



i=Ji+i \fe=i 



(23) 



By the definition of /3, we have 
i-||Y-X/3|| 2 + A n J> [J2Vd. 



2n 



'jk\\Pjk 



,Mi 



j=l \fc=l 
/d^||M) = 0forj> Ji,by([23 



V i J / Mj 

<_||y-X/3|| 2 + A n ^ Cj Ev^llft-fcl 
/ i=i \fe=i 



M, X 7 



-|(y-x/?)'x(/3 

n 



+ (1-7)A„ E c j (Ev^fcll^fcl 
i=Ji+i \fc=i 

J (Mj Y J ( M i 



i=i U=i 



i=i U=i 



5 l^-wf-l^-wf 

= ±\\X0-e)\\ 2 -±(Y-XP)'X0-P). 

An n 



Thus, by SRC condition (A3), with n~ l \\X(J3 



2 <c*\ 



', we have 



M, 



1 



(1- 7 )A„ J2 c i EV^HM] =^ll^(/3-/3)f :yfc|| 2 i y|| 
i=Ji+i \fc=i 



which implies, by Theorem 17. II (ii), 



M, 



(1-7)A„ E C 'J (Ev^ll&fcl 
j=Ji+i \k=i 



< C Um 2 <^W-M 2 = O p ^). (24) 

11 n 



We still need to find a lower bound of Ylj=j 1 +i c j ( Z~2k=i V^jkWjkW ) • Since Cj > 1 by assumption, 



E C J E V^ifc 

{J-\\P 0j l\\=0}3j \{k:\\/3 0jk \\=0}Bk 



7 / 

> ( E v^iim 

\0',*!):||j8oi*ll=0 



If \\J3b 2 \\ > 0, the combination of flM]) and ([25]) yields 



/ 2-7 > 



Since A n (log(i/n) 7 ' 2 1 — > oo by assumption, this implies that 



iMH&fell > 0) < Pr { \ n ( ^] 2 ' < 0,(1)1 -+ 0. 



(ii) We now prove the second part. Let h n = n 1 ' 2 define 



V ln (u) = L n ([3 + h n (u',0')') - L n (/3 ), 



with being the zero vector of dimension \E>2\- By (i), the following holds with large probability: 



P-Po = h n (u',0'y, u = argmin{y ln (-a);u G M dl } . 



The function V\ n {u),u G Mr 1 , can be written as 

■h / Mi 

V ln (u) = {-2h n u'X[e + hlu'X[Xiu} + A n ^ Cj ^ v^ll&fe + /»n«iJi 

3=1 \k=l 

7 



M,- 



Ev^ 



'jk\\Pjk\ 



T ln (w) + r 2n (n). 



For the first term, we have 



For the second term, 



Ti„ (it) ->£> -2u'W + u'Siw. 



M,- 



7-1 



M, 



T 2n {u) ->■ 7A0 ^ Cj I ^ A/djfc||/3oifc| 
i=i \fc=i 



fc=i 



/%fc 



2 ^ U ^p^ /(A) ^ ^ 0) + K-*]! 1 ^* = °) 



D 
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7.2 Additional numerical results 



Table 5: Simulation under the heterogeneity model: unmatching rate=25% and nonzero regression 
coefficients under case 2. In each cell, the first row is number of true positives (standard deviation), 
and the second row is model size (standard deviation). 



Correlation GLasso 



Proposed 







7 = 0.5 


7 = 0.7 


7 = 0.9 


AR p =0.2 


3.8(2.1) 


2.4(4.2) 


4.5(4.6) 


5.4(3.4) 




32.7(17.5) 


3.5(6.2) 


7.9(7.7) 


18.0(9.9) 


ARp=0.5 


6.0(2.4) 


9.4(4.1) 


10.2(3.2) 


10.3(2.5) 




36.7(16.1) 


13.3(5.9) 


14.7(4.7) 


20.0(7.7) 


AR p =0.8 


8.0(2.4) 


11.7(0.9) 


11.8(0.6) 


11.7(0.8) 




46.9(18.5) 


15.3(2.3) 


15.6(3.0) 


19.4(6.2) 


Banded 1 


3.8(2.1) 


2.9(4.5) 


4.4(4.5) 


5.8(3.4) 




31.8(17.0) 


4.3(6.6) 


7.0(7.0) 


18.1(9.9) 


Banded 2 


5.2(2.4) 


9.2(4.5) 


9.8(3.7) 


9.5(3.0) 




32.7(18.8) 


13.2(6.3) 


14.4(5.3) 


20.8(7.7) 


Banded 3 


5.7(2.5) 


9.4(4.1) 


9.7(3.4) 


10.4(2.3) 




35.1(17.5) 


13.0(5.6) 


14.7(5.2) 


22.8(8.7) 



Table 6: Simulation under the heterogeneity model: unmatching rate=50% and nonzero regression 
coefficients under case 2. In each cell, the first row is number of true positives (standard deviation), 
and the second row is model size (standard deviation). 



Correlation Lasso 



Proposed 







7 = 0.5 


7 = 0.7 


7 = 0.9 


AR p =0.2 


3.7(2.0) 


1.8(3.3) 


3.8(3.8) 


6.1(2.5) 




30.3(16.9) 


2.8(5.4) 


7.6(7.8) 


21.4(8.2) 


ARp=0.5 


5.3(2.6) 


8.0(3.8) 


9.4(2.7) 


9.5(1.7) 




33.5(20.1) 


13.6(7.8) 


17.4(6.3) 


22.9(6.7) 


AR p =0.8 


8.2(2.3) 


10.6(2.2) 


10.9(1.0) 


11.0(1.3) 




47.8(19.2) 


18.1(5.5) 


18.3(3.7) 


21.8(6.6) 


Banded 1 


3.9(2.0) 


1.9(3.0) 


3.4(3.6) 


5.8(3.1) 




31.2(18.5) 


3.1(5.3) 


6.3(7.0) 


18.5(9.0) 


Banded 2 


5.1(2.4) 


6.4(4.0) 


8.6(3.0) 


8.6(2.8) 




31.6(15.6) 


10.4(7.4) 


15.7(6.8) 


21.7(8.8) 


Banded 3 


5.8(2.4) 


8.3(3.5) 


9.0(2.7) 


9.4(2.1) 




36.2(18.5) 


13.9(7.4) 


15.6(5.4) 


21.7(7.2) 
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Table 7: Simulation under the homogeneity model: nonzero regression coefficients under case 1. 
In each cell, the first row is number of true positives (standard deviation), and the second row is 
model size (standard deviation). 



Correlation GLasso 



Proposed 







7 = 0.5 


7 = 0.7 


7 = 0.9 


AR p =0.2 


5.4(2.3) 


9.5(4.0) 


9.5(4.2) 


9.1(3.8) 




37.6(17.2) 


9.8(3.8) 


10.5(4.3; 


16.9(8.1) 


ARp=0.5 


7.1(2.7) 


11.6(1.9) 


11.6(1.6; 


11.5(1.7) 




41.8(21.5) 


11.8(1.4) 


11.8(1.3; 


15.2(4.4) 


AR p =0.8 


9.7(2.4) 


11.9(0.4) 


12.0(0.0; 


12.0(0.0) 




48.3(17.8) 


12.0(0.4) 


12.1(0.6; 


14.0(2.1) 


Banded 1 


5.1(2.0) 


9.9(4.1) 


10.1(3.8; 


9.9(3.2) 




33.8(16.2) 


10.2(3.9) 


11.2(3.6; 


17.9(6.9) 


Banded 2 


7.5(2.7) 


12.0(0.0) 


12.0(0.0; 


11.6(1.1) 




45.5(19.3) 


12.0(0.0) 


12.3(1.1; 


17.1(5.5) 


Banded 3 


8.3(2.3) 


11.8(0.9) 


11.9(0.8; 


11.9(0.5) 




50.5(18.5) 


11.8(0.9) 


12.0(0.9; 


15.9(3.7) 



Table 8: Simulation under the homogeneity model: nonzero regression coefficients under case 2. 
In each cell, the first row is number of true positives (standard deviation), and the second row is 
model size (standard deviation). 



Correlation 


GLasso 




Proposed 








7 = 0.5 


7 = 0.7 


7 = 0.9 


AR p =0.2 


4.1(2.4) 


6.0(5.1) 


6.9(4.8) 


7.9(3.6) 




33.0(18.8) 


6.2(5.1) 


8.9(5.8) 


19.9(7.7) 


ARp=0.5 


6.5(2.5) 


11.2(2.3) 


11.4(1.7) 


11.3(1.6) 




39.2(18.2) 


11.6(2.1) 


12.4(2.0) 


18.3(6.5) 


AR p =0.8 


8.2(2.6) 


11.9(0.4) 


11.9(0.4) 


11.9(0.6) 




46.9(21.9) 


12.0(0.6) 


12.4(1.5) 


16.4(5.8) 


Banded 1 


4.0(2.2) 


5.3(5.3) 


7.0(4.7) 


7.8(3.5) 




33.5(16.2) 


5.5(5.3) 


8.1(5.3) 


19.6(7.9) 


Banded 2 


6.0(2.2) 


10.6(3.6) 


11.1(2.5) 


11.2(1.8) 




38.1(16.8) 


10.9(3.2) 


12.8(3.6) 


19.1(6.2) 


Banded 3 


6.5(2.1) 


11.4(1.9) 


11.2(2.0) 


11.3(2.1) 




39.9(16.8) 


11.7(1.3) 


12.5(2.1) 


19.3(5.7) 



12 



Table 9: SNP-level estimates using the proposed approach: DLBCL. 



SNP 


Est 


SNP 


Est 


SNP 


Est 


SNP 


Est 


ALOX15B.01 


0.00847 


LMAN1.02 


-0.00386 


MCP.02 


-0.00366 


NCF4.42 


-0.00429 


ALOX15B.03 


0.00238 


LMAN1.04 


-0.00017 


MCP.04 


0.01080 


NCF4.43 


-0.00424 


ALOX15B.04 


0.00397 


LMAN1.05 


0.00876 


MCP.05 


-0.00116 


NCF4.44 


0.00086 


ALOX15B.06 


0.00490 


LMAN1.06 


-0.00343 


MCP.06 


0.00242 


NCF4.45 


-0.00098 


ALOX15B.07 


0.00288 


LMAN1.07 


0.00562 


MCP.07 


0.00031 


NCF4.46 


-0.00055 


ALOX5.01 


0.00357 


LMAN1.08 


0.00562 


MCP.08 


-0.00511 


NCF4.49 


0.00130 


ALOX5.41 


0.00559 


LMAN1.09 


0.00645 


MEFV.01 


0.00544 


SERPINB3.01 


-0.00410 


ALOX5.42 


0.00033 


MBP.02 


-0.00334 


MEFV.02 


0.01260 


SERPINB3.02 


0.00262 


ALOX5.43 


-0.00611 


MBP.03 


0.00084 


MEFV.03 


-0.00468 


SERPINB3.05 


0.00489 


ALOX5.44 


0.00825 


MBP.04 


0.00013 


MEFV.04 


0.00699 


SERPINB3.06 


0.00177 


ALOX5.45 


0.00357 


MBP.05 


0.00058 


MEFV.05 


-0.00081 


STAT4.08 


-0.00873 


ALOX5.46 


-0.00093 


MBP.06 


-0.00052 


MIF.01 


0.00760 


STAT4.09 


0.00045 


ALOX5.47 


0.00428 


MBP.07 


0.00058 


MIF_01_2 


0.00760 


STAT4.10 


0.00004 


ALOX5.48 


-0.00018 


MBP.08 


0.00171 


MIF.14 


-0.00358 


STAT4.11 


0.00485 


ALOX5.49 


0.00119 


MBP.09 


-0.00075 


MIF.15 


-0.00805 


STAT4.12 


-0.00138 


ALOX5.51 


0.00556 


MBP.10 


-0.00100 


MIF.16 


-0.00861 


STAT4.13 


-0.00156 


ALOX5.52 


0.00349 


MBP.12 


-0.00149 


MIF.18 


0.00750 


STAT4.14 


0.00097 


ALOX5.53 


0.00113 


MBP.13 


0.00118 


MIF.19 


0.00431 


STAT4.16 


-0.00500 


ALOX5.54 


0.01233 


MBP.14 


0.00006 


MIF.20 


0.00473 


STAT4.17 


-0.00605 


ALOX5.55 


0.00042 


MBP.15 


0.00003 


MIF.21 


0.00865 


STAT4.18 


0.00191 


CSF2.02 


0.01655 


MBP.16 


0.00123 


MIF.22 


-0.00365 


STAT4.19 


-0.00114 


DEFB1.01 


-0.00675 


MBP.17 


0.00021 


MIF.23 


-0.00232 


STAT4.21 


0.00118 


DEFB1.02 


-0.01199 


MBP.18 


0.00032 


MUC6_01 


-0.00386 


STAT4.23 


-0.00751 


DEFB1.03 


-0.00798 


MBP.19 


-0.00094 


MUC6_02 


-0.00825 


STAT4.24 


0.00048 


DEFB1.04 


0.00948 


MBP.20 


-0.00028 


MUC6_03 


-0.00041 


STAT4.25 


0.00687 


DEFB1.05 


-0.01192 


MBP.21 


-0.00084 


MUC6_04 


-0.00145 


STAT4.29 


0.00242 


DEFB1.06 


-0.00488 


MBP.22 


-0.00023 


MUC6_07 


-0.00329 


STAT4.30 


0.00274 


DEFB1.07 


-0.00561 


MBP.23 


-0.00159 


MUC6_08 


-0.00569 


STAT4.31 


-0.00531 


DEFB1.08 


-0.00195 


MBP.24 


-0.00097 


MUC6_09 


-0.00362 


STAT4.33 


-0.00654 


DEFB1.09 


0.00809 


MBP.25 


0.00063 


MUC6.10 


-0.01415 


STAT4.34 


0.00814 


DEFBl.ll 


-0.00332 


MBP.26 


0.00116 


MUC6.13 


0.01833 


STAT4.35 


0.00020 


DEFB1.12 


-0.00140 


MBP.27 


-0.00045 


MUC6.14 


-0.01229 


STAT4.36 


0.00369 


DEFB1.13 


-0.01077 


MBP.28 


0.00022 


NCF4.12 


-0.00118 


STAT4.37 


0.00462 


IL10.01 


-0.00009 


MBP.29 


-0.00214 


NCF4.18 


-0.00292 


STAT4.38 


0.00229 


IL10.02 


-0.00009 


MBP.30 


0.00030 


NCF4.33 


-0.00254 


STAT4.39 


-0.00357 


IL10.03 


0.00000 


MBP.31 


-0.00031 


NCF4.34 


-0.00077 


STAT4.41 


0.00203 


IL10.06 


0.00001 


MBP.32 


-0.00140 


NCF4.35 


-0.00932 


STAT4.42 


0.00206 


IL10.07 


0.00001 


MBP.33 


-0.00060 


NCF4.36 


-0.00220 


STAT4.43 


0.00346 


IL10.17 


0.00000 


MBP.34 


-0.00040 


NCF4.37 


-0.00483 


STAT4.44 


0.00711 


IL10_17_2 


0.00000 


MBP.35 


0.00255 


NCF4.38 


-0.00229 


STAT4.45 


0.00061 


LMANl.Ol 


0.00918 


MBP.36 


-0.00051 


NCF4.39 


0.00575 


STAT4.46 


0.00687 
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Table 10: 


SNP-level estimates using the proposed approach: FL. 




SNP 


Est 


SNP 


Est 


SNP 


Est 


SNP 


Est 


ALOX12.06 


0.01078 


IRAK2.20 


0.00404 


MBP.23 


0.00021 


NCF4.46 


0.00449 


ALOX12.07 


0.01315 


IRAK2.21 


0.00257 


MBP.24 


-0.00002 


NCF4.49 


0.00383 


ALOX12.09 


0.00692 


IRAK2.22 


0.00315 


MBP.25 


0.00075 


PTK9L.01 


0.00275 


CLCAl.Ol 


-0.00360 


IRAK2.23 


0.00319 


MBP.26 


-0.00051 


PTK9L.02 


0.01068 


CLCA1.02 


0.00514 


IRAK2.24 


-0.00416 


MBP.27 


-0.00013 


PTK9L.03 


0.00498 


CLCA1_03 


-0.00255 


IRAK2.25 


-0.00099 


MBP.28 


0.00049 


SOD3.27 


0.00399 


CLCA1.04 


-0.00307 


IRAK2.26 


-0.00563 


MBP.29 


-0.00156 


STAT4.08 


-0.00143 


CLCA1_05 


0.00312 


IRAK2.27 


0.00227 


MBP.30 


0.00080 


STAT4.09 


0.00123 


CLCA1.06 


-0.00192 


IRAK2.28 


0.00690 


MBP.31 


-0.00148 


STAT4.10 


0.00068 


CLCA1_07 


0.00004 


LIG4.02 


-0.01321 


MBP.32 


0.00037 


STAT4.11 


-0.00025 


CLCA1_08 


0.00606 


LMAN1.01 


-0.00644 


MBP.33 


-0.00061 


STAT4.12 


0.00048 


CLCA1_09 


-0.00514 


LMAN1.02 


0.00382 


MBP.34 


-0.00097 


STAT4.13 


0.00096 


CLCA1.10 


0.00190 


LMAN1.04 


-0.00467 


MBP.35 


-0.00030 


STAT4.14 


0.00050 


CLCA1.11 


-0.00208 


LMAN1.05 


0.00560 


MBP.36 


0.00067 


STAT4.16 


0.00027 


CLCA1.12 


0.00447 


LMAN1.06 


0.00355 


MIF.01 


-0.00031 


STAT4.17 


-0.00088 


CLCA1.13 


0.00266 


LMAN1.07 


-0.01116 


MIF_01_2 


-0.00031 


STAT4.18 


0.00121 


CLCA1.15 


-0.00221 


LMAN1.08 


-0.01116 


MIF.14 


0.00018 


STAT4.19 


0.00057 


CLCA1.16 


0.00426 


LMAN1.09 


-0.00280 


MIF.15 


0.00047 


STAT4.21 


0.00074 


CLCA1.17 


-0.00246 


MBP.02 


0.00000 


MIF.16 


0.00045 


STAT4.23 


-0.00026 


CLCA1.18 


-0.00427 


MBP.03 


0.00016 


MIF.18 


-0.00030 


STAT4.24 


0.00117 


CLCA1.19 


-0.00243 


MBP.04 


0.00065 


MIF.19 


0.00009 


STAT4.25 


-0.00119 


CLCA1.20 


0.00698 


MBP.05 


0.00047 


MIF.20 


0.00015 


STAT4.29 


0.00201 


CLCA1.21 


0.00241 


MBP.06 


-0.00001 


MIF.21 


-0.00031 


STAT4.30 


-0.00002 


CLCA1.22 


0.00865 


MBP.07 


0.00016 


MIF.22 


-0.00049 


STAT4.31 


-0.00118 


CLCA1.23 


-0.00595 


MBP.08 


0.00008 


MIF.23 


0.00021 


STAT4.33 


-0.00121 


IL17C.01 


0.01724 


MBP.09 


0.00124 


NCF4.12 


-0.00300 


STAT4.34 


-0.00174 


IRAK2.01 


-0.00111 


MBP.10 


0.00053 


NCF4.18 


0.00226 


STAT4.35 


0.00183 


IRAK2.02 


-0.00289 


MBP.12 


-0.00062 


NCF4.33 


-0.00048 


STAT4.36 


0.00046 


IRAK2.10 


0.00704 


MBP.13 


0.00084 


NCF4.34 


-0.00323 


STAT4.37 


-0.00041 


IRAK2.11 


0.00483 


MBP.14 


-0.00016 


NCF4J35 


-0.00107 


STAT4.38 


0.00059 


IRAK2.12 


-0.00195 


MBP.15 


-0.00066 


NCF4J36 


-0.00582 


STAT4.39 


0.00205 


IRAK2.13 


-0.00178 


MBP.16 


0.00070 


NCF4J37 


0.00349 


STAT4.41 


0.00105 


IRAK2.14 


-0.00234 


MBP.17 


0.00017 


NCF4.38 


-0.00065 


STAT4.42 


0.00011 


IRAK2.15 


0.00474 


MBP.18 


-0.00010 


NCF4.39 


0.00466 


STAT4.43 


0.00055 


IRAK2.16 


-0.00089 


MBP.19 


-0.00121 


NCF4.42 


0.00165 


STAT4.44 


0.00148 


IRAK2.17 


-0.00261 


MBP.20 


0.00020 


NCF4.43 


0.00013 


STAT4.45 


0.00062 


IRAK2.18 


-0.00047 


MBP.21 


0.00101 


NCF4.44 


-0.00265 


STAT4.46 


-0.00135 


IRAK2.19 


-0.00278 


MBP.22 


-0.00004 


NCF4.45 


-0.00420 
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Table 11: SNP-level estimates using 


the proposed approach 


: CLL/SLL. 




SNP 


Est 


SNP 


Est 


SNP 


Est 


SNP 


Est 


ALOX5.01 


0.00540 


IRAK2.13 


-0.00296 


MBP.14 


-0.00157 


STAT4.12 


-0.00348 


ALOX5.41 


-0.00105 


IRAK2.14 


-0.00353 


MBP.15 


-0.01136 


STAT4.13 


-0.00234 


ALOX5.42 


0.00280 


IRAK2.15 


0.00202 


MBP.16 


-0.00984 


STAT4.14 


-0.00177 


ALOX5.43 


0.00393 


IRAK2.16 


-0.00217 


MBP.17 


0.00380 


STAT4.16 


0.00067 


ALOX5.44 


0.00407 


IRAK2.17 


0.00271 


MBP.18 


-0.01136 


STAT4.17 


-0.00307 


ALOX5.45 


0.00504 


IRAK2.18 


0.00186 


MBP.19 


-0.00883 


STAT4.18 


-0.00050 


ALOX5.46 


-0.00096 


IRAK2.19 


0.00266 


MBP.20 


-0.00373 


STAT4.19 


-0.00249 


ALOX5.47 


0.00234 


IRAK2.20 


-0.00165 


MBP.21 


0.00081 


STAT4.21 


-0.00214 


ALOX5.48 


0.00200 


IRAK2.21 


-0.00307 


MBP.22 


-0.00438 


STAT4.23 


-0.00306 


ALOX5.49 


0.00334 


IRAK2.22 


-0.00025 


MBP.23 


0.00720 


STAT4.24 


-0.00144 


ALOX5.51 


-0.00205 


IRAK2.23 


0.00186 


MBP.24 


0.00087 


STAT4.25 


0.00143 


ALOX5.52 


-0.00043 


IRAK2.24 


0.00226 


MBP.25 


-0.00234 


STAT4.29 


-0.00179 


ALOX5.53 


0.00556 


IRAK2.25 


-0.00147 


MBP.26 


0.00109 


STAT4.30 


-0.00187 


ALOX5.54 


0.00552 


IRAK2.26 


-0.00364 


MBP.27 


0.00462 


STAT4.31 


-0.00045 


ALOX5.55 


-0.00058 


IRAK2.27 


0.00008 


MBP.28 


0.00801 


STAT4.33 


-0.00056 


IL10.01 


-0.00305 


IRAK2.28 


-0.00263 


MBP.29 


-0.00328 


STAT4.34 


-0.00119 


IL10.02 


-0.00421 


MBP.02 


0.00696 


MBP.30 


0.00278 


STAT4.35 


0.00023 


IL10.03 


-0.00313 


MBP.03 


0.01177 


MBP.31 


-0.00510 


STAT4.36 


-0.00246 


IL10.06 


-0.00212 


MBP.04 


0.01112 


MBP.32 


0.00658 


STAT4.37 


0.00073 


IL10.07 


-0.00313 


MBP.05 


0.00279 


MBP.33 


-0.00924 


STAT4.38 


-0.00174 


IL10.17 


-0.01264 


MBP.06 


0.01161 


MBP.34 


0.00411 


STAT4.39 


-0.00032 


IL10_17_2 


-0.01264 


MBP.07 


-0.00203 


MBP.35 


0.00028 


STAT4.41 


-0.00224 


IRAK2.01 


0.00421 


MBP.08 


-0.00272 


MBP.36 


0.00618 


STAT4.42 


-0.00210 


IRAK2.02 


0.00072 


MBP.09 


0.00394 


STAT4.08 


-0.00153 


STAT4.43 


-0.00111 


IRAK2.10 


-0.00098 


MBP.10 


-0.00098 


STAT4.09 


-0.00124 


STAT4.44 


-0.00281 


IRAK2.11 


-0.00096 


MBP.12 


0.00964 


STAT4.10 


-0.00123 


STAT4.45 


0.00067 


IRAK2.12 


0.00107 


MBP.13 


0.00312 


STAT4.11 


-0.00205 


STAT4.46 


0.00143 
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Table 12: Analysis of the NHL data using group Lasso for each subtype sep- 
arately: L2-norm of estimate for a specific gene; OOI: observed occurrence 
index. 



Gene 


DLBCL 


FL 




CLL/SLL 




L2-norm 


OOI 


L2-norm 


OOI 


L2-norm 


OOI 


AHR 






0.017 


0.97 






ALOX12 






0.019 


1.00 






ALOX15B 


0.012 


1.00 










APEX1 


0.000 


0.33 










BAT5 






0.003 


0.33 






BHMT 


0.003 


0.74 


0.001 


0.28 






C1QA 


0.010 


0.91 










C1QB 






0.019 


0.90 






C1QG 


0.003 


0.46 










CIS 






0.004 


0.56 






C2 










0.014 


0.90 


C4BPA 


0.001 


0.25 










C5 


0.010 


0.93 










C8A 






3E-04 


0.36 






CASP10 


0.010 


0.97 










CCL13 


0.004 


0.84 










CCND1 






0.003 


0.46 






CCR1 






0.008 


0.79 






CCR8 










0.018 


0.76 


CENTA1 










0.008 


0.66 


CSF2 


0.006 


0.88 






0.006 


0.69 


CSF3 


0.011 


0.94 










CTNNB1 






0.004 


0.54 






CX3CR1 


0.006 


0.81 










CYP1A2 






0.002 


0.51 






CYP1B1 


0.012 


0.91 


0.001 


0.23 






CYP2C9 






0.017 


0.90 






DEF6 






0.006 


0.69 






DEFB1 


0.002 


0.70 










DHX33 


0.015 


1.00 










EPHX1 


0.007 


0.88 


0.006 


0.67 






ERCC2 


0.010 


0.97 










ERCC5 






0.003 


0.18 






GGH 






0.001 


0.31 


0.003 


0.45 


ICAM2 


0.009 


0.90 










ICAM4 


0.004 


0.86 










ICAM5 






0.011 


0.95 






IFNGR1 










0.002 


0.38 


IL10 


0.002 


0.61 






0.011 


0.86 


IL10RA 


0.005 


0.84 










IL15 


0.007 


0.96 






0.005 


0.90 


IL15RA 






0.002 


0.54 






IL17C 






0.013 


0.92 






IL3 


0.004 


0.84 










KLK6 










0.006 


0.38 


LEPR 


0.007 


0.90 










LIG4 










0.000 


0.17 


LMAN1 






0.002 


0.41 






LPO 


0.005 


0.61 










MASP2 






0.004 


0.62 






MCP 


0.011 


0.90 










MEFV 


4E-04 


0.38 










MIF 


0.002 


0.43 


















Continued on next 


page 
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Table 12 - 


continued from previous 


page 


Gene 


DLBCL 


FL 




CLL/SLL 




L2 norm 


OOI 


Z/2 norm 


OOI 


L2 norm OOI 


MLH1 


0.001 


0.54 








MTHFD2 


0.004 


0.52 








MTR 










0.001 0.28 


MTRR 


0.014 


1.00 








MUC6 


0.008 


0.96 








MUC7 






0.005 


0.77 




MYC 










0.006 0.86 


T2 










0.001 0.34 


NBS1 


0.001 


0.65 








NCF4 






0.001 


0.18 




NFKBIE 


0.007 


0.94 






0.003 0.41 


NOS3 


0.006 


0.84 








OGG1 


0.003 


0.64 


0.010 


0.72 




PDCD10 


0.001 


0.52 








PFKFB2 


0.005 


0.87 








PPARG 






0.005 


0.38 




PRO1580 










0.005 0.55 


PTK9 


0.011 


1.00 








PTK9L 






0.018 


0.92 




RAD23B 


0.019 


1.00 








RAG1 


0.004 


0.74 








SECTM1 


0.001 


0.54 








SELE 


0.002 


0.75 






0.001 0.45 


SENP3 


0.004 


0.75 








SERPINB3 


0.004 


0.84 


0.003 


0.26 




SOCS4 


0.009 


0.93 








SOD3 


3E-04 


0.28 


0.008 


0.85 




STAT5B 










0.002 0.52 


STAT6 


0.002 


0.57 








STK11 










0.002 0.48 


TCN1 


0.008 


0.94 








TICAM1 


0.001 


0.59 








TLR1 










0.019 0.90 


TLR9 


0.002 


0.42 








TNFRSF18 






0.003 


0.67 




TOLLIP 


8E-05 


0.29 


0.005 


0.54 




TP53 






0.001 


0.46 




TYK2 


0.003 


0.81 








WDHD1 


0.015 


1.00 








WRN 






0.003 


0.49 




XRCC1 


0.004 


0.70 






9E-05 0.28 


XRCC3 


4E-04 


0.39 








XRCC4 










0.010 0.76 


ZNF76 










0.006 0.66 


ZP1 










0.006 0.76 
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